Goto

Collaborating Authors

 better data


More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Zhao, Yike, Guo, Simin, Yang, Ziqing, Han, Shifan, Lin, Dahua, Tan, Fei

arXiv.org Artificial Intelligence

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.


There's no Data Like Better Data: Using QE Metrics for MT Data Filtering

Peter, Jan-Thorsten, Vilar, David, Deutsch, Daniel, Finkelstein, Mara, Juraska, Juraj, Freitag, Markus

arXiv.org Artificial Intelligence

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems~(NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.


AI expert Meredith Broussard: 'Racism, sexism and ableism are systemic problems'

The Guardian

Meredith Broussard is a data journalist and academic whose research focuses on bias in artificial intelligence (AI). She has been in the vanguard of raising awareness and sounding the alarm about unchecked AI. Her previous book, Artificial Unintelligence (2018), coined the term "technochauvinism" to describe the blind belief in the superiority of tech solutions to solve our problems. She appeared in the Netflix documentary Coded Bias (2020), which explores how algorithms encode and propagate discrimination. Her new book is More Than a Glitch: Confronting Race, Gender and Ability Bias in Tech.


Hippo Insurance CTO insurtech predictions for 2023

#artificialintelligence

As we welcome the new year, it's natural to reflect on the year that passed and look ahead to the challenges and opportunities that lie ahead, and more specifically how new technologies might impact the insurance industry. As always, we must separate the signal from the noise. For many, artificial intelligence is a perennial buzzword, but paradoxically, it appears the technology is largely still in its infancy in the insurance industry, and especially in the home insurance space. Regulators and insurers alike are understandably grappling with challenges created by the lack of model explainability, presenting challenges for the widespread use of AI to directly evaluate and price risk for homeowners insurance in the near future. Instead, major technological innovation in homeowners insurance in the coming year will likely come from solutions and tools designed to improve the ingestion and processing of data in ways that positively impact the consumer experience throughout their homeownership journey.


Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligence

Yacine Jernite's fears about bias in artificial intelligence were vividly affirmed in 2017, when a Facebook translation error led Israeli police to arrest a Palestinian construction worker. The man had posted a picture of himself leaning against a bulldozer with the caption, in Arabic, "good morning." Facebook mistakenly translated it, in Hebrew, as "attack them." The error was quickly discovered and the man released, according to a report in Haaretz, but the incident cemented personal concerns about AI for Jernite, who joined Facebook's AI division soon after. As the child of Moroccan parents in post-9/11 America, Jernite said he has "spent hours upon hours in immigration secondary interviews -- in a way that I could not at the time trace to the technology that was being applied."


Data Centric Artificial Intelligence

#artificialintelligence

The data-centric artificial intelligence is the modern approach to building AI systems using quality data. The data-centric AI prioritizes the quality of data over the quantity of data, while traditional model-centric AI does the opposite. The key is better data, not big data! The key idea of data-centric AI is to handle data the same way as handling high-quality materials when building a house i.e. spend relatively more time labelling, augmenting, managing and curating the data. The traditional way is to optimize the highly parameterized models using big data and achieve high performance.


Big Tech builds AI with bad data. So scientists sought better data.

#artificialintelligence

Now Jernite, 33, is trying to push AI in a better direction. After leaving Facebook, he joined BigScience, a global effort by 1,000 researchers in 60 countries to build a more transparent, accountable AI, with less of the bias that infects so many Big Tech initiatives. The largely volunteer effort trained a computer system with good data that was curated by humans from different cultures, rather than readily available data scraped from the internet, written mostly in English, and riddled with harmful speech on race, gender and religion. The resulting AI was released on July 12 for researchers to download and study.



Every Business Can Work More Efficiently With Better Data

#artificialintelligence

This 10-course bundle can get you up to speed on today's top data technologies and will help you better utilize data in your own business's day-to-day. The courses are all taught by Zenva Academy (4.4/5-star instructor rating), one of the premier online learning destinations. While the bundle covers a number of technologies, it gives special emphasis to Python. Python is the world's most popular programming language because of its general-purpose language that focuses on readability and extensibility. Because it's so flexible, it's used in everything from bulk mathematical calculation to web and mobile backends to machine learning. It's an essential tool to learn if you want to work with massive amounts of data and this bundle will introduce you to Python before giving you practical instruction in working with Python to read data from APIs, process images, work with Python Turtle, visualize data in many ways, build a game, and more.


MIT Researcher Explores The Downside Of Machine Learning In Healthcare - Liwaiwai

#artificialintelligence

While working toward her dissertation in Computer Science, Marzyeh Ghassemi PhD '17 wrote some papers on how machine learning techniques from AI could be applied to clinical data in order to predict patient outcomes. "It wasn't until the end of my PhD work that one of my committee members asked: 'Did you ever check to see how well your model worked across different groups of people?'" That question was eye-opening for Ghassemi, who had previously assessed the performance of models in aggregate, across all patients. Upon a closer look, she saw that models often worked differently, specifically worse, for minorities like black women--a revelation that took her by surprise. "I hadn't made the connection beforehand that health disparities would translate directly to model disparities," she says. "And given that I am a visible minority woman-identifying computer scientist at MIT, I am reasonably certain that many others weren't aware of this either."